A probabilistic framework for aligning paired-end RNA-seq data
نویسندگان
چکیده
MOTIVATION The RNA-seq paired-end read (PER) protocol samples transcript fragments longer than the sequencing capability of today's technology by sequencing just the two ends of each fragment. Deep sampling of the transcriptome using the PER protocol presents the opportunity to reconstruct the unsequenced portion of each transcript fragment using end reads from overlapping PERs, guided by the expected length of the fragment. METHODS A probabilistic framework is described to predict the alignment to the genome of all PER transcript fragments in a PER dataset. Starting from possible exonic and spliced alignments of all end reads, our method constructs potential splicing paths connecting paired ends. An expectation maximization method assigns likelihood values to all splice junctions and assigns the most probable alignment for each transcript fragment. RESULTS The method was applied to 2 x 35 bp PER datasets from cancer cell lines MCF-7 and SUM-102. PER fragment alignment increased the coverage 3-fold compared to the alignment of the end reads alone, and increased the accuracy of splice detection. The accuracy of the expectation maximization (EM) algorithm in the presence of alternative paths in the splice graph was validated by qRT-PCR experiments on eight exon skipping alternative splicing events. PER fragment alignment with long-range splicing confirmed 8 out of 10 fusion events identified in the MCF-7 cell line in an earlier study by (Maher et al., 2009). AVAILABILITY Software available at http://www.netlab.uky.edu/p/bioinfo/MapSplice/PER.
منابع مشابه
Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript
MOTIVATION The discovery of novel gene fusions can lead to a better comprehension of cancer progression and development. The emergence of deep sequencing of trancriptome, known as RNA-seq, has opened many opportunities for the identification of this class of genomic alterations, leading to the discovery of novel chimeric transcripts in melanomas, breast cancers and lymphomas. Nowadays, few comp...
متن کاملStatistical Modeling of RNA-Seq Data.
Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform ...
متن کاملLearning Probabilistic Splice Graphs from RNA-Seq data
RNA-Seq technology provides the foundation for accurately measuring gene expression levels when paired with a model for mapping the produced sequencing reads to a reference genome. Because of the shorter length of RNA-Seq reads, a single read is not uniquely mapped to a single location in the genome and requires probabilistic treatment to accurately measure relative expression levels. We presen...
متن کاملAnalysis of paired end Pol II ChIP-seq and short capped RNA-seq in MCF-7 cells
While a role of promoter-proximal RNA Polymerase II (Pol II) pausing in regulation of eukaryotic gene expression is implied, the mechanisms and dynamics of this process are poorly understood. We performed genome-wide analysis of short capped RNAs (scRNAs) and Pol II chromatin immunoprecipitation sequencing (ChIP-seq) in human breast cancer MCF-7 cells to better understand Pol II pausing (Samara...
متن کاملFusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution
MOTIVATION Next generation sequencing technology generates high-throughput data, which allows us to detect fusion genes at both transcript and genomic levels. To detect fusion genes, the current bioinformatics tools heavily rely on paired-end approaches and overlook the importance of reads that span fusion junctions. Thus there is a need to develop an efficient aligner to detect fusion events b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 26 16 شماره
صفحات -
تاریخ انتشار 2010